Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add new component CSVDocumentSplitter to recursively split CSV documents #8815

Merged
merged 24 commits into from
Feb 10, 2025

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Feb 5, 2025

Related Issues

Proposed Changes:

Alternative approach as discussed in this PR: #8795

How did you test it?

Added unit tests.

Notes for the reviewer

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@coveralls
Copy link
Collaborator

coveralls commented Feb 5, 2025

Pull Request Test Coverage Report for Build 13239732856

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 92.299%

Totals Coverage Status
Change from base Build 13237729498: 0.05%
Covered Lines: 9288
Relevant Lines: 10063

💛 - Coveralls

@sjrl sjrl marked this pull request as ready for review February 7, 2025 13:30
@sjrl sjrl requested review from a team as code owners February 7, 2025 13:30
@sjrl sjrl requested review from dfokina and davidsbatista and removed request for a team February 7, 2025 13:30
@alex-stoica
Copy link

alex-stoica commented Feb 9, 2025

Hey @sjrl I've tested your implementation a little bit

Performance:
When running the tests test_large_standard_case and test_large_side_by_side_case, I observed the following timings:

14.66s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_side_by_side_case
14.48s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_standard_case

which, on average, is a 2x improvement in terms of speed vs the previous BFS approach. 💪 good job

Correctness
Consider the test case test_complex_bridging

    def test_complex_bridging(self) -> None:
        """
        Rows bridging from left to right => BFS splits each row into left block & right block.
        """
        csv_data = """ID,LeftVal,,,RightVal,Extra
1,Hello,,,World,Joined
2,StillLeft,,,StillRight,Bridge

A,B,,,C,D
E,F,,,G,H
"""
        splitter = CSVDocumentSplitterV2(row_split_threshold=1, column_split_threshold=1)
        result = splitter.run([Document(content=csv_data)])
        docs = result["documents"]
        assert len(docs) == 4
        block_texts = [doc.content for doc in docs]
        assert any("ID,LeftVal" in text for text in block_texts)
        assert any("Hello" in text for text in block_texts)
        assert any("World,Joined" in text for text in block_texts)
        assert any("StillLeft" in text for text in block_texts)
        assert any("StillRight,Bridge" in text for text in block_texts)
        assert any("A,B" in text for text in block_texts)
        assert any("C,D" in text for text in block_texts)
        assert any("E,F" in text for text in block_texts)
        assert any("G,H" in text for text in block_texts)

In this scenario, the expected output should yield 4 documents rather than 2. skip_blank_lines in pandas is default True and this causes the problem. The same issue occurs for the test test_empty_rows_and_side_tables

Usability
Overall, I find your implementation to be more intuitive and easier to understand

@sjrl
Copy link
Contributor Author

sjrl commented Feb 10, 2025

Hey @sjrl I've tested your implementation a little bit

Performance: When running the tests test_large_standard_case and test_large_side_by_side_case, I observed the following timings:

14.66s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_side_by_side_case
14.48s call     test/components/preprocessors/test_csv_document_splitter.py::TestCSVDocumentSplitterV2::test_large_standard_case

which, on average, is a 2x improvement in terms of speed vs the previous BFS approach. 💪 good job

Correctness Consider the test case test_complex_bridging

    def test_complex_bridging(self) -> None:
        """
        Rows bridging from left to right => BFS splits each row into left block & right block.
        """
        csv_data = """ID,LeftVal,,,RightVal,Extra
1,Hello,,,World,Joined
2,StillLeft,,,StillRight,Bridge

A,B,,,C,D
E,F,,,G,H
"""
        splitter = CSVDocumentSplitterV2(row_split_threshold=1, column_split_threshold=1)
        result = splitter.run([Document(content=csv_data)])
        docs = result["documents"]
        assert len(docs) == 4
        block_texts = [doc.content for doc in docs]
        assert any("ID,LeftVal" in text for text in block_texts)
        assert any("Hello" in text for text in block_texts)
        assert any("World,Joined" in text for text in block_texts)
        assert any("StillLeft" in text for text in block_texts)
        assert any("StillRight,Bridge" in text for text in block_texts)
        assert any("A,B" in text for text in block_texts)
        assert any("C,D" in text for text in block_texts)
        assert any("E,F" in text for text in block_texts)
        assert any("G,H" in text for text in block_texts)

In this scenario, the expected output should yield 4 documents rather than 2. skip_blank_lines in pandas is default True and this causes the problem. The same issue occurs for the test test_empty_rows_and_side_tables

Usability Overall, I find your implementation to be more intuitive and easier to understand

Thanks for taking a look @alex-stoica! I’ll make the changes you suggest and add the test cases for this.

@sjrl sjrl removed the topic:core label Feb 10, 2025
Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - we should now also have an issue to keep track the documentation for this new component

@sjrl sjrl merged commit f9e6e48 into main Feb 10, 2025
19 checks passed
@sjrl sjrl deleted the recursive-csv-splitter branch February 10, 2025 17:10
@sjrl
Copy link
Contributor Author

sjrl commented Feb 10, 2025

LGTM - we should now also have an issue to keep track the documentation for this new component

Opened issue here: #8835

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a CSV Document splitter
4 participants